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(57) ABSTRACT 

A distributed computer system and method for determining 
cluster membership in a distributed computer system. A 
plurality of computers configurable as cluster nodes are 
coupled through one or more public and/or private commu- 
nications networks. Cluster management software running 
on the plurality of computers is configured to group various 
ones of the computers into a cluster. Weighting values are 
assigned to each node, such as by relative processing power. 
Each fully connected subset of nodes are grouped into a 
possible cluster configuration. The weighting value of each 
subset is calculated. The membership in the cluster is chosen 
based on the subset with the optimum weighting value 
among all the possible cluster configurations. The maximum 
weighting value may be adjusted if the maximum weighting 
value is greater than or equal to the sum of all other 
weighting values for all other nodes in the current cluster 
configuration. The maximum weighting factor may be 
adjusted to a value below the sum of all other weighting 
values for all other nodes in the current cluster configuration. 

41 Claims, 5 Drawing Sheets 
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SYSTEM AND METHOD FOR One serious situation that must be avoided is the split- 

DETERMINING CLUSTER MEMBERSHIP IN brain condition. A split-brain is where two differing subsets 

A HETEROGENEOUS DISTRIBUTED of nodes each think that they are the cluster and that the 

SYSTEM members of the other subset have shut down their clustering 

5 software. The split-brain condition leads to data and file 

PRIORITY DATA corruption, since the two subsets each think that they are the 

„ . .... ... - cluster with control of all data and files. 

Inis application is a continuation-in-part or patent apph- _ , . . , , 

cation having Sen No. 08/955,885, entitled "Determining . ™f > 11 ca " be seen a P™"? C0DCem ^ clust f s 

Cluster Membership in a Distributed Computer System", in 15 t0 how 10 determine what configuration is optimum for 

whose inventors are Hossein Moiin, Ronald Widyono, and 10 DUmber f d c ? uplmg ° f ™ m P ui ™f™ a failure * 

Ramin Modiri, filed on Oct. 21, 1997, now U.S. Pat. No. Considerations such as how many of the available comput- 

5 999 712 issued on Dec 7 1999 ers shoulcI be 1D ^ cluster and which computers can freely 

' ' communicate should be taken into account. It would thus be 

BACKGROUND OF THE INVENTION desirable to have an optimized way to determine member- 

15 ship in the cluster after a failure causes a reconfiguration of 

1. Field of the Invention the cluster membership. 
This invention relates to distributed computer systems, 

and more particularly to a system and method for dynami- SUMMARY OF THE INVENTION 

cally determining cluster membership. ^ The problems outlined above are in large part solved by 

2. Description of the Related Art a system and method for determining cluster membership in 
As databases and other large-scale software systems grow, a distributed computer system. In one embodiment, the 

the ability of a single computer to handle all of the tasks system comprises a plurality of computer nodes coupled 
associated with the database diminishes. Other concerns, through one or more communications networks. These net- 
such as failure handling and the response time under a large 25 works may include private and/or public data networks, 
volume of concurrent queries, also increase the number of Each of the computer nodes executes cluster management 
problems that a single computer must face when running a software that helps determine cluster membership in the 
database program. distributed computer system. Weighting values assigned to 
There are two basic ways to handling a large-scale soft- each node are combined to choose an optimal configuration 
ware system. One way is to have a single computer with 30 for tne cluster. A cluster configuration must be determined 
multiple processors running a single operating system as a u P on initiation of a new cluster. Cluster reconfiguration of an 
symmetric multiprocessing system. The other way is to existing cluster must also occur if a node joins or leaves the 
group a number of computers together to form a cluster, a cluster. The most common reason for a node to leave the 
distributed computer system that works together as a single cluster is by failure, either of the node itself or a commu- 
entity to cooperatively provide processing power and mass 35 nication line coupling the node to the cluster. Basing cluster 
storage resources. Clustered computers may be in the same membership decisions upon weighting factors assigned to 
room together, or separated by great distances. By forming each computer node may advantageously increase availabil- 
a distributed computing system into a cluster, the processing itv and performance by favoring the most valued (fastest, 
load is spread over more than one computer, eliminating etc -) nodes in the cluster when nodes must be failed to 
single points of failure that could cause a single computer to 40 P revent split-brain configurations. 

abort execution. Thus, programs executing on the cluster A method is contemplated, in one embodiment, to deter- 
may ignore a problem with one computer. While each mine the membership of nodes in the cluster by assigning a 
computer usually runs an independent operating system, weighting value to each of the nodes. The weighting value 
clusters additionally run clustering software that allows the may be based upon various factors, such as relative pro- 
plurality of computers to process software as a single unit. 45 cessing power of the node, amount of physical memory, etc. 

Another problem for clusters is how to configure into a A first subset of the nodes is grouped into a first possible 

cluster or how to reconfigure the cluster after a failure. Initial cluster configuration, while a second subset of the nodes is 

configuration of the cluster is described in related and grouped into a second possible cluster configuration. The 

co-pending patent application having Ser. No. 08/955,885, weighting values of each subset are combined to calculate a 

entitled "Determining Cluster Membership in a Distributed 50 first and a second value for the first and second possible 

Computer System", whose inventors are Hossein Moiin, cluster configurations, respectively. The membership in the 

Ronald Widyono, and Ramin Modiri, filed on Oct. 21, 1997, cluster is chosen based on the first and second values. In a 

now U.S. Pat. No. 5,999,712 issued on Dec. 7, 1999. A further embodiment, the first and second subsets may be but 

failure may be hardware and/or software, and the failure a start to a number of subsets of nodes, each grouped into a 

may be in a computer node or in a communications network 55 possible cluster configuration according to predetermined 

linking the computer nodes. A group of computer nodes that rules. In this further embodiment, the weighting values are 

is attempting to reconfigure the cluster will each vote for calculated for each possible cluster configuration. The mem- 

their preferred membership list for the cluster. If the alter- bership in the cluster is chosen based on the weighting 

natives have configurations that distinctly differ, an elected values calculated for each possible cluster configuration, 

membership list for the cluster is often easily determined 60 This feature may advantageously result in the cluster recon- 

based on some arbitrarily set selection criteria. In other figuring with an optimized configuration. The method may 

cases, a quorum of votes from the computer nodes, or a be implemented in software. 

centralized decision-maker, must decide on the cluster mem- In a further embodiment, the weighting values for the 

bership. A quorum may be defined as the number of votes computer nodes are compared to find a node with the 

that have to be cast for a given cluster configuration mem- 65 maximum weighting value. The maximum weighting value 

bership list for that cluster configuration to be selected as the may be adjusted if the maximum weighting value is greater 

current cluster configuration membership. than or equal to the sum of all other weighting values for all 
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other nodes in the current cluster configuration. According storage device 130 A with computer node 110B and shares 
to one preferred embodiment, the maximum weighting mass storage device 130D with computer node UOD. Corn- 
factor is adjusted to a value below the sum of all other puter node HOC shares mass storage device 130B with 
weighting values for all other nodes in the current cluster computer node HOB and shares mass storage device 130C 
configuration. This feature may advantageously result in the 5 with computer node HOD. Each computer node 110 is 
cluster having an optimized configuration that is less sus- coupled to communicate with the other computer nodes 110 
ceptible to single mode failures. through a communications network 120. Additionally, pri- 
vate interconnects, not shown, may also be used to couple 
BRIEF DESCRIPTION OF THE DRAWINGS ^ computer nodes 110. Private interconnects are the pre- 

ferred route for "keep alive" messages between computer 

Other objects and advantages of the invention will iU n odes that are members of the cluster, 

become apparent upon reading the following detailed It is noted that a variety of other topologies is avaUable for 

description and upon reference to the accompanying draw- co^g a plurality of computer nodes in a cluster arrange- 

ings in which: ment. As examples, in an N+l topology, each mass storage 

FIGS. lAand IB are block diagrams of embodiments of ^ device is shared between a primary computer node and a 

typical distributed computer systems that may configured as backup computer node. The backup computer node is thus 

clusters; coupled to all of the mass storage devices, while each 

FIG. 2 is an embodiment of typical software layers that primary computer node carries the primary processing load 

may be found in a distributed computer system configured as of the cluster. In a clustered pair topology, the mass storage 

a cluster; 2 o devices are shared between pairs of computer nodes. The 

FIGS. 3A and 3B are block diagrams of embodiments cluster 151 configured as a plurality of pairs of computer 

illustrating possible communications breakdowns of clusters nodes - Coupling between the computer nodes in the system 

similar to those shown in FIGS. 1A and IB; ma y also route through other computer nodes. These types 

FIG. 4 is a flowchart illustrating an embodiment of a of configurations are fussed with respect to FIG. 3A 

method for determining which computer nodes are members 25 Turnuig now to FIG. 2, a block diagram of an embodi- 

of the cluster- and ment of tv P lcal software layers that may be found in a 

- . ' „ «_ ^ .„ 4 4 - .,. . r distributed computer system configured as a cluster is 

FTG^ 5 is a flowchart illustrating an embodiment of a $hown ^ five x illustmed include lhe ti 

method for adjusting the weighting factors of computer 21Q ^ framework 220 the application 

nodes that are members of the cluster. ^ p J rograraming interfaces (AP t s) 2 30, the data services 240, 

While the invention is susceptible to various modifica- md me chlstcr systcm management 250. It is noted that 

tions and alternative forms, specific embodiments thereof otner software configurations are possible and that the 

are shown by way of example in the drawings and will so ftware layers and interrelationships shown are exemplary 

herein be described in detail. It should be understood, om y Some or all of the operations of the software may be 

however, that the drawings and detailed description thereto 35 carr i e d ou t in firmware or hardware, 

are not intended to limit the invention to the particular form The base wttw&tt layer is the operating system 210. The 

disclosed, but on the contrary, the intention is to cover all operating system 210 is preferably a variant of UNIX, such 

modifications, equivalents and alternatives falling within the as S0 LARIS 2.5.1, available from Sun Microsystems, Inc. 

spirit and scope of the present invention as defined by the of Palo Alt0> olher implementations may use 

appended claims, 40 Q fa CT operating systems such as Open VMS, available from 

DETAILED DESCRIPTION OF THE Digital Equipment Corp. of Maynard, Mass., or WINDOWS 

INVENTION ^» ava ^ aD ^ e fr° m Microsoft Corp. of Redmond, Wash., as 

desired. Preferable properties of the operating system 

Patent application having Ser. No. 08/955,885, entitled include full support for symmetric multithreading and 

"Determining Cluster Membership in a Distributed Com- 45 multiprocessing, flexibility, availability, and compatibility to 

puter System", whose inventors are Hossein Moiin, Ronald support enterprise -wide computing, including the cluster. 

Widyono, and Ramin Modiri, filed on Oct. 21, 1997, now The operating system 210 and related software preferably 

U.S. Pat. No. 5,999,712 issued Dec. 7, 1999, from which this provides networking protocols, stacks, and sockets, as well 

application clams priority, is herein incorporated by refer- as security for the cluster. 

ence in its entirety. 50 duster framework 220 runs on top of the operating 

Turning to FIGS. lAand IB, block diagrams of embodi- system 210. The cluster framework includes the fault man- 

ments of typical distributed computer systems that may be agement components, which provide fault detection and 

configured as clusters are illustrated. Shown in FIG. 1A is a recovery, failover, and dynamic cluster reconfiguration, 

typical N-to-N topology. Four computers 110A, HOB, HOC, Cluster connectivity module 222 monitors communications 

and 110D are coupled through a communications network 55 between each of the computer nodes in the cluster. Typically, 

120, which is preferably a public communications network. a computer node in the cluster sends a "1" as a "keep alive" 

Three mass storage devices 130A, 130B, and 130C are packet, either to every other computer node with which it is 

available to each of the computers 110 through data storage m communication or just to its nearest neighbors, to indicate 

communications linkage 125. Mass storage devices 130 may its presence in the cluster. Cluster membership and quorum 

include and/or exchange data with various storage media, 60 and reconfiguration 224 maintains the proposed membership 

including mechanical, electrical, magnetic, chemical, and lists and the elected membership list and provides configu- 

optical devices. The computers 110 may also be coupled to ration and reconfiguration decision making. Switching and 

each other directly through private interconnects 115. failover 226 detects problems and maintains the data and 

FIG. IB illustrates a ring topology coupling four com- communications integrity of the cluster when failures in 

puter nodes 110A-110D. Each mass storage device, 65 hardware or software occur. Reconfiguration upon detection 

130A-130D is dual ported and shared between two com- of a failure typically is completed in a matter of minutes, 

puter nodes. As shown, computer node U0A shares mass Failover preferably includes cascaded failovers of a com- 
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puler node in the cluster to multiple, redundant backup computer nodes 310A and 310B, while subset 330B includes 

computer nodes, as well as file-lock migrations to avoid file computer nodes 310C, 310D, 310E, and 310F. It is desirable 

corruption. that the applications executing on the cluster continue run- 

ITie application programming interfaces 230 are prefer- nin 8 without corruption of files and data. The cluster system 

ably designed to integrate commercially available and cus- s software, described above, preferably reconfigures the clus- 

tom high availability applications into the cluster environ- ,er '° K f n °PV mum ^f^T^^ V 

t c i c Am ii a ♦ i ♦ j * t, j j* available choices, subset 330A and subset 330B. 

ment. Examples of APIs 230 contemplated include a data T , • i j- .l . j 

. _ ■ a dt nnA n f<ini# .'t^r, ADi Am In FIG. 3B, a cluster mcludmg three computer nodes 

service API and a fault monitoring API. The data service API - inT -, ftXT . ' , « . j ^ AT j-imx* 

a j «| . v *• , i r >i j 3 10L-310N is shown. Computer nodes 310L and 310M are 

is configured to allow generic applications to be failed over communications network320L, while com- 

to another computer node in the event of a monitored failure. 10 £ ^ £ m ^ nm m d q commu _ 

Control over programs to be automatically started, stopped, nications Dctwork 320 M. The three computer nodes are 

or restarted is typically done by scripts or C language led ^ I0Ugh a cornmunic ations network 325L to mass 

programs called through the data service API. The fault storage unit 330L. 

monitoring API is configured to allow for custom As shown, daU communications between computer nodes 

application-specific fault monitoring. An application can 15 3 10L and 310M have failed, separating the cluster into two 

thus be monitored, started, stopped, restarted, or failed over groU pi ngs 0 f computer nodes 310, subset 330L includes 

to another computer node when a failure is detected. It is comput6r node 310L , while subset 330M includes computer 

contemplated that various APIs 230, including APIs not nodes 310M arjd 310N . It ^ des irable that the applications 

specifically mentioned above, may be used in the system running on foe cluster continue running without corruption 

alone or concurrently, as desired. *) of fileg and ^ ^ duster gystem software> desC ribed 

Data service modules 240 are layered on top of the cluster above, preferably reconfigures the cluster to an optimum 

framework 220 and are specific to certain data service configuration based on the currently available choices, sub- 

applications such as parallel databases 242 and high avail- set 330L and subset 330M. 

ability services 244. Examples of parallel databases include Turning now to FIGS. 4 and 5, flowcharts illustrating 

those from Oracle Corp. of Redwood Shores, Calif, and embodiments of a method for determining which computer 

INFORMIX of Menlo Park, Calif. Typical high availability node s are members of the cluster and a method for adjusting 

services 244 that may be monitored for faults include me weighting factors of the nodes in the cluster are shown, 

network file services (NFS), databases, and Internet services The cluster system software overviewed in FIG. 2 deter- 

such as domain name, mail, news, and web servers. It is ^ mmes the membership list for the cluster based on data 

contemplated that multiple independent instances of high including communications availability among the computer 

availability services 244 may be run concurrently on the nodes. Typically, the software modules in layers 220 and 250 

cluster. are responsible for determining the membership in the 

Cluster system management includes a control panel 252, cluster, 

a cluster console interface 254, an on-line fault monitor 256, 35 First, define a new cluster instance each time there is a 

and storage control 258. The control panel 252 is preferably change to the cluster membership. Let C(i) be the set of 

a graphical user interface (GUI) -based administration tool. computer nodes that are the members of the ith instance of 

The cluster console interface 254 is preferably a remote the cluster. For example, if nodes 0, 1, 2, and 4 are in the 

access GUI-based console configured to provide convenient third instance of the cluster, C(3)~{0, 1, 2, 4}. If node 3 then 

centralized access to all computer nodes in the cluster. The 4Q joins the cluster, C(4)={0, 1, 2, 3, 4}. If node 1 now leaves 

on-line fault monitor 256 is preferably a GUI-based monitor the cluster, C(5)={0, 2, 3, 4}. Note that i increases with each 

that allows administrator of the cluster a visual color-coded change of membership in the cluster. The special case of 

representation of the state of the entire distributed computer C(0)=*{ } is the empty cluster before first formation of the 

system that includes the cluster. Preferably the on-line fault cluster. 

monitor 256 integrates with the Solstice SyMON system 45 For this embodiment, let us assume that the weights are 

monitor available bundled with the SOLARIS operating non-negative integers. Now let sQ) be the static weight for 

system 210 to allow for integrated hardware monitoring of node j. The static weight s(j) is preferably a constant that is 

individual computer nodes in the cluster. Storage control sct by configuration, although other methods and times for 

258 preferably includes either enterprise- wide or cluster- setting the static weight are contemplated. Let w(i,j) be the 

wide disk management and logical volume configuration. 5Q dy namic weight for node j in cluster instance i. In one 

RAID and backup software are typically managed through embodiment, w(i,j)=0 if node j is not a member of cluster 

the storage control module 258. instance i. In this embodiment, w(ij)=s(j) if there exists in 

Turning now to FIGS. 3A and 3B, block diagrams of the cluster instance i, a node k, such that s(k)>s(j), that is, 

embodiments showing possible communications break- another node s(k) in the cluster instance already has a static 

downs of clusters similar to those shown in FIGS. 1A and IB 55 weight greater than s(j). The dynamic weight w(i j)-s(j) in 

are illustrated. In FIG. 3 A, a cluster including six computer the additional case that s(j) is less than the sum of all other 

nodes 310A-310F is shown. Computer nodes 310A and nodes in cluster instance i. If s(j) is greater than or equal to 

31 0B are coupled through communications network 320A. the sum of all other nodes in cluster instance i, then 

Computer nodes 310A and 3 10C are coupled through com- w(i j)=one less than the sum of all other nodes in cluster 

munications network 320B. Computer nodes 31 0C is 60 instance i. 

coupled to computer node 310D through communications According to one embodiment, as new nodes join the 

network 320C and to computer node 310E through co mmu- cluster, the dynamic weight of the highest valued node, 

nications network 320D. Computer nodes 310E and 310F which may have been previously reduced, may go back up. 

are coupled through communications network 320E. The dynamic weight of the highest valued node will go up, 

As shown, data communications between computer nodes 65 in this embodiment, if possible. 

31 OA and 310C have failed, separating the cluster into two In another embodiment, the dynamic weights w(ij) are 

groupings of computer nodes 310: subset 330A includes determined only after the cluster membership is known, that 
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is, C(i>0) is known. This implies that when determining 
membership in a cluster instance C(i), the dynamic weights 
w(i-l j) are used. Special rules for the special case of C(0) 
are given in related and patent application having Ser. No. 
08/955,885, entitled 'Determining Cluster Membership in a 5 
Distributed Computer System", whose inventors are 
Hossein Moiin, Ronald Widyono, and Ramin Modiri, filed 
on Oct. 21, 1997, now U.S. Pat. No. 5,999,712 issued Dec. 
7, 1999. 

To form a new instance of the cluster C(i+1), let W(i) be 1° 
the sum of w(i,j) for all j. To form a new instance of the 
cluster, there must exist at least one subset M such that the 
sum of w(i,k), for all k who are members of the proposed 
subset M is greater than or equal to [W(i)+l]/2. If no subset 
satisfies this rule, the entire cluster goes down. It is noted 15 
that this rule is analogous to the [N+l]/2 rule, where N is the 
number of nodes in the current cluster with each node having 
implied weight of 1. It is also noted that this rule can be 
restated in terms of N+l, N-l, Ceiling, and/or Floor func- 
tions. 20 

The embodiment of the method 400 shown in FIG. 4 
comprises the following. A weighting value is assigned to 
each computer that may act as a node in the cluster 410. The 
weighting value may be indicative, for example, of the 
relative processing power of the computer. The method also 25 
determines with which other computer nodes each computer 
node is in communication 420. It is noted that communica- 
tion for cluster purposes may include limitations such as a 
maximum response time to a request. Thus, two computers 
may be able to communicate data and still not qualify as in 30 
communication for the purposes of forming or reconfiguring 
a cluster. In one embodiment, the cluster membership mod- 
ule 224 in a computer node sends its weighting value to 
every other computer node with which it is in communica- 
tion to indicate its presence in the cluster, as opposed to just 35 
a 'T'. 

In 430, each node broadcasts to all other nodes the 
communication data determined in 420. In 440, each node 
then receives the communication data determined in 420 and 4Q 
broadcast in 430. There is no loss of generality if one or 
more computer nodes do not receive the cluster communi- 
cation data Those nodes will simply be left out of the cluster. 
In 450, each computer node determines a proposed mem- 
bership list for the cluster based on the cluster communica- 45 
tion data received in 440. The computer nodes exchange 
proposed membership lists in 460. The preferred member- 
ship list is chosen from among the proposed membership 
lists in 470. 

The embodiment of the method 500 shown in FIG. 5 50 
comprises the following. The method finds the node with the 
maximum weighting factor 510. The maximum weighting 
factor may be adjusted if the maximum weighting factor is 
greater than a predetermined function of the weighting 
factors of the other computer nodes 520. For example, in one 55 
embodiment, the function may be simple addition. In this 
embodiment, the maximum weighting factor is compared to 
the sum of the weighting factors of all of the other computer 
nodes that are currently in the cluster. Other functions are 
similarly contemplated, including multiplication and even 60 
more complex comparison techniques, such as the use of 
logarithms. Subsets of computer nodes are then grouped into 
two or more possible cluster configurations 530. 

Each possible cluster configuration has a configuration 
value calculated from the weighting values of the computer 65 
nodes in that possible cluster configuration 540. The func- 
tion used to make the calculation is chosen as desired. A 
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preferred function is simple addition. The method chooses in 
550 the cluster configuration based on the configuration 
values calculated in 540 for each possible cluster configu- 
ration. In one embodiment, the comparison between the 
configuration values is made to find the configuration value 
with the maximum value. It is noted that the weighting 
values used in calculating the configuration values may 
include a dynamically modified weighting value for the 
computer node with the maximum weighting value. 

In one embodiment, the computer nodes that are attempt- 
ing to reconfigure a cluster will each vote for their preferred 
membership list for the cluster. If the alternatives have 
configuration values that are different, an elected member- 
ship list for the cluster is often easily determined based on 
the selection criteria. In all other cases, a quorum of votes 
from the computer nodes or a centralized decision- maker 
must decide on the cluster membership. A quorum may be 
defined as the number of votes that have to be cast for a 
given cluster configuration membership list for that cluster 
configuration to be selected as the current cluster configu- 
ration membership. Since the split-brain condition, where 
two subsets of nodes each think that they are the cluster, 
must be avoided to avoid data and file corruption, quorum is 
preferably the majority of votes that can be cast by the nodes 
already in the cluster before reconfiguration. Each computer 
node may get one vote or the number of votes equal to its 
weighting factor. 

It is noted that in a preferred embodiment, grouping 
subsets of computer nodes into a plurality of possible cluster 
configurations 530 and finding the node with the maximum 
weighting factor 510, both in FIG. 5, are subsets of deter- 
mining proposed membership lists for the cluster in 450 of 
FIG. 4. It is also noted that the flowcharts of FIGS. 4 and 5 
are exemplary only, and portions of FIGS. 4 and 5 may occur 
in different orders. For example, the method may perform 
510 and 520 after 550. In other words, the method finds the 
node with the maximum weighting factor 510 and adjusts 
the maximum weighting factor if needed, after choosing the 
cluster configuration 550. 

Exemplary applications of embodiments of the method 
described above may be made with reference to FIGS. 3A 
and 3B. In FIG. 3A, assume that in the two groupings of 
computer nodes 310, subset 330A computer nodes 310A 
and 310B have weighting values of ten and three, 
respectively, while subset 330B computer nodes 310C, 
310D, 310E, and 310F each have values of one. 

From straight addition of the weighting values of the 
subsets, subset 330A has a configuration value of thirteen, 
while subset 330B has a configuration value of four. If the 
maximum configuration value is the selection criterion, then 
subset 330A will become the reconfigured cluster. 

If dynamic weighting of the maximum weighting factor is 
used, the determination may change. The other computer 
node weighting values sum to seven. Thus, in one 
embodiment, the maximum weighting value is dynamically 
lowered to six from ten. Subset 330A thus has a configura- 
tion value of nine, while subset 330B has a configuration 
value of four. If the maximum configuration value is the 
selection criterion, then subset 330A will again become the 
reconfigured cluster. 

In FIG. 3B, assume that the computer nodes 310L, 310M, 
and 310N have respective weighting factors of ten, five, and 
three. Subset 330L only includes computer nodes 310L and 
has a configuration value under straight addition of ten. 
Subset 330M includes computer nodes 310M and 31 ON and 
has a configuration value under straight addition of eight. If 
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the maximum configuration value is the selection criterion, 
then subset 330L will become the reconfigured cluster. This 
reconfigured cluster has a known single point of failure. If 
computer 310L fails, the cluster fails completely. Dynamic 
revaluation of the maximum weighting factor may avoid this 5 
known single point of failure. The sum total of all other 
weighting factors is eight It is noted that the failure of 
computer node 310L is equivalent to the failure of commu- 
nications network 320L. In other words, computer node 
310L may be functionally operational (i.e. "healthy"), but 10 
lose communications (such as through communications net- 
work 320L) with other nodes 310. 

In the embodiment where the maximum weighting factor 
is adjusted to a value less than the sum total of all other 
weighting factors, the weighting factor of computer node 15 
310L is dynamically lowered to seven. Now, subset 31 0L 
has a configuration value of seven, while subset 310M has 
a configuration value of eight. If the maximum configuration 
value is the selection criterion, then subset 330M will 
become the reconfigured cluster, avoiding the single point of 20 
failure configuration. 

It is noted that in the above -described embodiments, 
specific calculations and criteria are illustrated. These spe- 
cific calculations and criteria may vary in other embodi- 
ments. It is also noted that in a two-node cluster, the 25 
remaining node will reconfigure as a one-node cluster 
instead of shutting the cluster down upon a failure leading to 
the second node leaving the cluster. While the above 
embodiments of the method principally describe software, 
the method may also be implemented as firmware or in 30 
hardware, as desired. 

Numerous variations and modifications will become 
apparent to those skilled in the art once the above disclosure 
is fully appreciated. It is intended that the following claims 
be interpreted to embrace all such variations and modifica- 35 
lions. 

What is claimed is: 

1. A method for determining membership of nodes in a 
cluster in a distributed computer system, the method com- 
prising: 4Q 

assigning a weighting value to each of the nodes; 

grouping a first subset of said nodes into a first possible 
cluster configuration; 

grouping a second subset of said nodes into a second 
possible cluster configuration; 45 

combining the weighting values of the first subset of said 
nodes to calculate a first value; 

combining the weighting values of the second subset of 
said nodes to calculate a second value; and 5Q 

choosing either said first subset or said second subset for 
membership in said cluster depending upon a result of 
said first value of said first possible cluster configura- 
tion and said second value of said second possible 
cluster configuration. 55 

2. The method of claim 1, wherein the weighting value 
assigned to a respective node is indicative of the relative 
processing power of said respective node. 

3. The method of claim 1, wherein said grouping a first 
subset of said nodes into a first possible cluster configuration 6Q 
and said grouping a second subset of said nodes into a 
second possible cluster configuration include: 

determining with which other nodes each node is in 

communication; 
combining the weighting factors of various subsets of 65 

nodes which are all in communication; and 
choosing from said various subsets of nodes. 
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4. The method of claim 3, further comprising: 
comparing the weighting values assigned to each of the 

nodes in the current cluster configuration to find a node 
with a maximum weighting factor, wherein the maxi- 
mum weighting factor is adjusted if the maximum 
weighting factor is greater than or equal to the sum of 
all other nodes in the current cluster configuration. 

5. The method of claim 4, wherein the maximum weight- 
ing factor is adjusted to a value less than said sum of said all 
other nodes. 

6. A method for determining membership of nodes in a 
cluster in a distributed computer system, the method com- 
prising: 

assigning a weighting value to each of the nodes; 

determining with which other nodes each node is in 
communication; 

determining alternatives for a proposed membership list 
for each node based on said determining with which 
nodes each node is in communication and the weight- 
ing values assigned to each of the nodes; 

adding the weighting factors of nodes involved in each 
alternative for the proposed membership list for each 
node to arrive at a sum for each alternative for the 
proposed membership list for each node; 

choosing a preferred alternative for the proposed mem- 
bership list from the alternatives for the proposed 
membership list, wherein the preferred alternative for 
the proposed membership list has the sum that is a 
maximum value. 

7. The method of claim 6, further comprising: 
comparing the weighting values assigned to each of the 

nodes in the current cluster configuration to find a node 
with a maximum weighting factor, wherein the maxi- 
mum weighting factor is adjusted if the maximum 
weighting factor is greater than or equal to the sum of 
all other nodes in the current cluster configuration. 

8. The method of claim 7, wherein the maximum weight- 
ing factor is adjusted to a value less than the sum of all other 
nodes. 

9. A method for determining membership of nodes in a 
distributed computer system, the method comprising: 

assigning a weighting value to each of the nodes; 

determining with which other nodes a selected node is in 
communication; 

broadcasting to the other nodes communication data 
specifying the other nodes with which the selected node 
is in communication; 

receiving the communication data specifying the other 
nodes with which the selected node is in communica- 
tion; 

determining a proposed membership list based on the 
communication data specifying the other nodes with 
which the selected node is in communication and the 
weighting values assigned to each of the nodes; 

broadcasting the proposed membership list to the other 
nodes with which the selected node is in communica- 
tion; 

receiving the proposed membership lists from each of the 
other nodes with which the selected node is in com- 
munication; and 

determining an elected membership list from the proposed 
membership lists. 

10. The method of claim 9, wherein the weighting value 
assigned is indicative of the relative processing power of 
each node. 
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U. The method of claim 9, wherein determining a pro- 
posed membership list based on the communication data 
specifying the other nodes with which the selected node is 
in communication and the weighting values assigned to each 
of the nodes includes: 5 
adding the weighting factors of nodes involved in various 

groupings for the proposed membership list; and 
choosing a grouping for the proposed membership list that 
has a maximum sum obtained from said adding the 
weighting factors of nodes involved in various group- 10 
ings for the proposed membership list. 

12. The method of claim 9, further comprising: 
comparing the weighting values assigned to each of the 

nodes in the elected membership list to find a node with 
a maximum weighting factor, wherein the maximum 15 
weighting factor is adjusted if the maximum weighting 
factor is greater than or equal to the sum of all other 
nodes in the elected membership list. 

13. The method of claim 12, wherein the maximum 
weighting factor is adjusted downward to a value less than 2Q 
the sum of all other nodes. 

14. A distributed computer system, comprising: 
one or more communications networks; 

a plurality of computers each configurable as a cluster 
node, wherein the plurality of computers are coupled to ^ 
the one or more communications networks, wherein 
each of the plurality of computers is assigned a weight- 
ing value; and 

cluster management software running on the plurality of 
computers, wherein said cluster management software 30 
establishes cluster membership, wherein the clustering 
software is configured to: 

assign a weighting value to each of various ones of the 

plurality of computers; 
group a first subset of said various ones into a first 35 

possible cluster configuration; 
group a second subset of said various ones into a 

second possible cluster configuration; 
combine the weighting values of the first subset to 

calculate a first value; 40 
combine the weighting values of the second subset to 

calculate a second value; and 
choose either said first subset or said second subset for 

membership in said cluster depending upon a result 

of said first value of said first possible cluster con- 45 

figuration and said second value of said second 

possible cluster configuration. 

15. The distributed computer system of claim 14, wherein 
said one or more communications networks include one or 
more public communications networks. 50 

16. The distributed computer system of claim 14, further 
comprising: 

a private interconnect configured to further couple 
together various ones of the plurality of computers. 

17. The distributed computer system of claim 16, wherein 55 
the private interconnect is further configured to exchange 
cluster configuration data among the various ones of the 
plurality of computers. 

18. The distributed computer system of claim 17, wherein 
the private interconnect is further configured to exchange 
database traffic among the various ones of the plurality of 60 
computers. 

19. The distributed computer system of claim 14, wherein 
each of the plurality of computers includes at least one 
network interface card configured to couple to the commu- 
nications network. 65 

20. The distributed computer system of claim 14, further 
comprising: 
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one or more mass storage devices coupled in the distrib- 
uted computer system, wherein the plurality of com- 
puters are configured to access the one or more mass 
storage devices. 

21. The distributed computer system of claim 20, wherein 
each of the one or more mass storage devices is coupled to 
one or more of the plurality of computers. 

22. The distributed computer system of claim 14, wherein 
the cluster management software is further configured to: 

add the weighting factors of said various ones included in 
said first subset; 

add the weighting factors of said various ones included in 
said second subset; and 

choose either said first subset or said second subset for 
membership in said cluster depending upon a maxi- 
mum sum obtained from said add the weighting factors 
of said various ones included in said first subset and 
said add the weighting factors of said various ones 
included in said second subset. 

23. The distributed computer system of claim 22, wherein 
the further the cluster management software is further con- 
figured to: 

compare the weighting values assigned to each of the 
nodes in the membership of the cluster to find a node 
with a maximum weighting factor, wherein the maxi- 
mum weighting factor is adjusted if the maximum 
weighting factor is greater than or equal to the sum of 
all other nodes in the membership in the cluster. 

24. The distributed computer system of claim 23, wherein 
the maximum weighting factor is adjusted downward to a 
value less than the sum of all other nodes in the membership 
of the cluster. 

25. A distributed computer system, comprising: 
one or more communications networks; 

a plurality of computers each configurable as a cluster 
node, wherein the plurality of computers are coupled to 
the one or more communications networks, wherein 
each of the plurality of computers is assigned a weight- 
ing value; and 

cluster management software running on the plurality of 
computers to configure various ones of the plurality of 
computers into a cluster, wherein the clustering soft- 
ware is configured to: 

determine with which other nodes a selected node is in 
communication; 

broadcast to the other nodes communication data speci- 
fying the other nodes with which the selected node is 
in communication; 

receive the communication data specifying the other 
nodes with which the selected node is in communi- 
cation; 

determine a proposed membership list based on the 
communication data specifying the other nodes with 
which the selected node is in communication and the 
weighting values assigned to each of the nodes; 

broadcast the proposed membership list to the other 
nodes with which the selected node is in communi- 
cation; 

receive the proposed membership lists from each of the 
other nodes with which the selected node is in 
communication; and 

determine an elected membership list from the pro- 
posed membership lists. 

26. The distributed computer system of claim 25, wherein 
the cluster management software is further configured to: 

add the weighting factors of nodes involved in various 
groupings for the proposed membership list; and 



04/20/2004, EAST Version: 1.4.1 



US 6,1' 

13 

choose a grouping for the proposed membership list that 
has a maximum sum obtained from said adding the 
weighting factors of nodes involved in various group- 
ings for the proposed membership list. 

27. The distributed computer system of claim 26, wherein 
the cluster management software is further configured to: 

compare the weighting values assigned to each of the 
nodes in the elected membership list to find a node with 
a maximum weighting factor, wherein the maximum 
weighting factor is adjusted if the maximum weighting 
factor is greater than or equal to the sum of all other 
nodes in the elected membership list. 

28. The method of claim 27, wherein the maximum 
weighting factor is adjusted downward to a value less than 
the sum of all other nodes. 

29. A distributed computer system that determines which 
nodes are member of a cluster, comprising: 

means for assigning a weighting value to each of the 
nodes; 

means for grouping a first subset of said nodes into a first 

possible cluster configuration; 
means for grouping a second subset of said nodes into a 

second possible cluster configuration; 
means for combining the weighting values of the first 

subset of said nodes to calculate a first value; 
means for combining the weighting values of the second 

subset of said nodes to calculate a second value; and 
means for choosing either said first subset or said second 

subset for membership in said cluster depending upon 

a result of said first value of said first possible cluster 

configuration and said second value of said second 

possible cluster configuration. 

30. The distributed computer system of claim 29, further 
comprising: 

means for comparing the weighting values assigned to 
each of the nodes in the membership of the cluster to 
find a node with a maximum weighting factor, and 

means for adjusting the weighting value of the node with 
the maximum weighting factor if the maximum weight- 
ing factor is greater than or equal to the sum of the 
weighting values of all other nodes in the membership 
in the cluster. 

31. The distributed computer system of claim 30, wherein 
the maximum weighting factor is adjusted to a value less 
than said sum of said all other nodes. 

32. The distributed computer system of claim 29, wherein 
the weighting value assigned to a respective node is indica- 
tive of the relative processing power of said respective node. 

33. A distributed computer system that determines which 
nodes are members of a cluster, comprising: 

means for assigning a weighting value to each of the 
nodes; 

means for determining with which other nodes a selected 

node is in communication; 
means for broadcasting to the other nodes communication 

data specifying the other nodes with which the selected 

node is in communication; 
means for receiving the communication data specifying 

the other nodes with which the selected node is in 

communication; 
means for determining a proposed membership list based 

on the communication data specifying the other nodes 

with which the selected node is in communication and 

the weighting values assigned to each of the nodes; 
means for broadcasting the proposed membership list to 

the other nodes with which the selected node is in 

communication; 
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means for receiving the proposed membership lists from 
each of the other nodes with which the selected node is 
in communication; and 

means for determining an elected membership list from 
s the proposed membership lists. 

34. The distributed computer system of claim 33, wherein 
the weighting value assigned is indicative of the relative 
processing power of each node. 

35. The distributed computer system of claim 33, wherein 
10 said means for determining a proposed membership list 

based on the communication data specifying the other nodes 
with which the selected node is in communication and the 
weighting values assigned to each of the nodes includes: 
means for adding the weighting factors of nodes involved 
15 in various groupings for the proposed membership list; 
and 

means for choosing a grouping for the proposed mem- 
bership list that has a maximum sum obtained from said 
adding the weighting factors of nodes involved in 
20 various groupings for the proposed membership list. 

36. The distributed computer system of claim 33, further 
comprising: 

means for comparing the weighting values assigned to 
each of the nodes in the elected membership list to find 
25 a node with a maximum weighting factor, and 

means for adjusting the weighting value of the node with 
the maximum weighting factor if the maximum weight- 
ing factor is greater than or equal to the sum of the 
weighting values of all other nodes in the elected 
30 membership list. 

37. The method of claim 36, wherein the maximum 
weighting factor is adjusted downward to a value less than 
the sum of all other nodes. 

38. A storage medium configured to store instructions that 
35 determine membership of nodes in a cluster in a distributed 

computer system, said instructions comprising; 
assigning a weighting value to each of the nodes; 
grouping a first subset of said nodes into a first possible 
cluster configuration; 
40 grouping a second subset of said nodes into a second 
possible cluster configuration; 
combining the weighting values of the first subset of said 

nodes to calculate a first value; 
combining the weighting values of the second subset of 
45 said nodes to calculate a second value; and 

choosing either said first subset or said second subset for 
membership in said cluster depending upon a result of 
said first value of said first possible cluster configura- 
tion and said second value of said second possible 
50 cluster configuration. 

39. Hie storage medium of claim 38, the instructions 
further comprising: 

comparing the weighting values of all nodes in the 
membership in said cluster to find a node with a 
55 maximum weighting factor; and 

adjusting the maximum weighting factor if the maximum 
weighting factor is greater than or equal to the sum of 
the weighting factors of all other nodes in the mem- 
bership in said cluster. 
60 40. The storage medium of claim 39, wherein the maxi- 
mum weighting factor is adjusted to a value less than said 
sum of said all other nodes. 

41. The storage medium of claim 38, wherein the weight- 
ing value assigned to a respective node is indicative of the 
65 relative processing power of said respective node. 

* # * * * 
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